When interacting with mobile devices we expect relevant information to be presented with a minimum of input effort. What is relevant depends on the context in which we are acting.
If we are requesting a route information while sitting at a busstop we most probably are looking for directions for travel via bus, while at a railway station we most probably are looking for a train connection.
One possibility for a device to identify the context is via geolocation information. But this information may not be available inside buildings. An alternative approach is the analysis of ambient sound. This approach is referred to by the term acoustic scene classification.
Acoustic scene classification (ACS) describes the "capability of a human or anartificial system to understand an audio context, either from an on-linestream or from a recording." (http://www.cs.tut.fi/sgn/arg/dcase2016/documents/workshop/Valenti-DCASE2016workshop.pdf).
This workbook demonstrates how convolutional networks can be used for the classification task.
We will be applying a pre-trained VGG-16 network with a custom classifier applied on log-frequency power spectrograms.
This will provide a trained model which can be used to classify new audio recordings.
The success can be measured by the accuracy of predicting a validation data set. Reaching 70 % accuracy would be a good value for making the results applicable.
This project uses recordings made available as part of the DCASE (Detection and Classification of Acoustic Scenes and Events) 2019 challenge (http://dcase.community/challenge2019/task-acoustic-scene-classification). The TAU Urban Acoustic Scenes 2019 development dataset contains recordings in 10 different settings (airport, indoor shopping mall, metro station, pedestrian street, stree traffic, tram, bus, metro, park) recorded in 10 cities. Each recording is 10 seconds long. The data files can be downloaded from https://zenodo.org/record/2589280 using data/download.sh.
This workbook assumes that the extracted audio files are in directory data/TAU-urban-acoustic-scenes-2019-development/audio/ relative to this notebook.
import IPython.display as ipd
import librosa, librosa.display
import math
import numpy as np
from collections import OrderedDict
import os
from PIL import Image, ImageDraw, ImageOps
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
import statistics
import torch
from torch import nn, optim
from torchvision import datasets, transforms, models
import warnings
import wave
Let's analyze what the data looks like:
rawDataPath = 'data/TAU-urban-acoustic-scenes-2019-development/audio/'
def filenames2dataFrame(path):
"""Collect metadata of audio files
path - directory containing audio files
return - data frame with metadata
"""
data = []
for filename in os.listdir(path):
filepath = os.path.join(path, filename)
head, tail = os.path.split(filename)
parts = tail.split('-', 2)
with wave.open(filepath, "rb") as file:
sr = file.getframerate()
duration = file.getnframes() / sr
data.append( {
'setting' : parts[0],
'city' : parts[1],
'recording' : parts[2],
'duration' : duration,
'sr' : sr
})
return pd.DataFrame(data)
df = filenames2dataFrame(rawDataPath)
print('Durations are in the range {} - {} s.\n'.format(df['duration'].min(), df['duration'].max()))
print('Sample rates are in the range {} - {} Hz.\n'.format(df['sr'].min(), df['sr'].max()))
print('There are a total of {} recordings.\n'.format(df.shape[0]))
print('Number of recordings per category:\n')
print(df.set_index(['setting', 'city', 'duration', 'sr']).count(level='setting'))
print(df.set_index(['setting', 'city', 'duration', 'sr']).count(level='city'))
This looks benign:
Let's listen to one of the files.
sr = 48000
def readAudioFile(filename, sr=48000):
"""Reads an audio file with default sampling rate 48000Hz
filename - file to be read
return - numpy.float32
"""
x, sr = librosa.load(filename, sr=sr)
return x
x = readAudioFile(rawDataPath + 'street_pedestrian-lyon-1162-44650-a.wav')
print('number of samples {}'.format(x.shape[0]))
ipd.Audio(x, rate=sr)
To analyze the audio files we can transform them into spectrograms (cf. https://en.wikipedia.org/wiki/Spectrogram). These show the frequency distribution for subsequent short time intervals.
A popular form of spectrograms are Mel spectrograms. The Mel scale is based on what humans perceive as equal pitch differences. The Mel scale defines how the frequency axis is scaled:
$$m = 2595 \log_{10}\left(1 + \frac{f}{700}\right)$$The result of the scaling is that for high frequencies the scale is proportional to the logarithm of the frequency while low frequency (especially below 700 Hz) are compressed.
For speech analysis this scale is widely used.
def melSpectrogram(x, sr=48000):
"""Draw mel spectrogram
x - samples
sr - sample rate
"""
hop_length = 1875 # This gives us 256 time buckets: 1875 = 10 * 48000 / 256
n_fft = 8192 # This sets the lower frequency cut off to 48000 Hz / 8192 * 2 = 12 Hz
S = librosa.feature.melspectrogram(x, sr=sr, n_fft=n_fft, hop_length=hop_length)
logS = librosa.power_to_db(abs(S))
plt.figure(figsize=(15, 5))
librosa.display.specshow(logS, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel power spectrum')
melSpectrogram(x)
Ambiant sound can contain a lot of low frequency sounds, e.g.
These are the frequencies that are compressed by the Mel scale.
When the running speed of machines is changed this will move much of the sound spectrum by the same factor. While the Mel scale distorts this shift for low frequencies the spectrum would be simply translated along the frequency axis on a pure logarithimic scale by the same distance.
So using a logarithmic scale for the the analysis seems more appropriate.
Let's create a log-frequency power spectrogram (also referred to as constant-Q power spectrogram).
def logFrequencySpectrogram(x, sr=48000):
hop_length = 1024 # must be a multiple of 2^number_of_octaves
n_bins = 256 # number of bins
bins_per_octave = 24 # a quarter note
fmin = 10 # 10 Hz
fmax = fmin * math.pow(2., n_bins / bins_per_octave)
C = librosa.cqt(x, sr=sr, fmin = 10, hop_length=hop_length, n_bins= n_bins,
bins_per_octave = bins_per_octave)
logC = librosa.power_to_db(np.abs(C))
return logC
def drawLogFrequencySpectrogram(x, sr=48000):
logC = logFrequencySpectrogram(x, sr)
plt.figure(figsize=(15, 5))
librosa.display.specshow(logC, sr=sr, x_axis='time', y_axis='cqt_hz', fmin = 10, bins_per_octave = 24)
plt.colorbar(format='%+2.0f dB')
plt.title('Constant-Q power spectrum')
plt.show()
drawLogFrequencySpectrogram(x)
The high frequencies can be emphasized using a filter.
def preEmphasis(x, alpha=.97):
"""emphasise high frequencies
This filter mixes the signal with its first derivative:
y[n] = (x[n] - alpha * x[n-1]) / (1 - alpha)
Reference:
Haytham Fayek,
"Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral
Coefficients (MFCCs) and What's In-Between",
https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
The implementation by Haytham Fayek lacks a good first value. This
implementation uses x[-1] = 2 * x[0] - x[1].
Further to keep the amplitudes for low frequencies unchanged the output signal
is divided by (1 - alpha).
x - original signal
alpha - proportion of the output signal that comes from the first derivative
"""
return np.append((1 - 2 * alpha) * x[0] + alpha * x[1],
x[1:] - alpha * x[:-1]) / (1 - alpha)
p = preEmphasis(x, .97)
ipd.Audio(p, rate=sr)
drawLogFrequencySpectrogram(p)
The spectogram can easily be converted to a black and white image. This is the input format that we will use for training the neural network.
def bwLogFrequencySpectrogram(x, sr=48000):
"""Create a log-requency spectrogram as an image
x - samples
return - spectrogram
"""
logC = logFrequencySpectrogram(p);
scaler = MinMaxScaler(feature_range=(0, 255))
logC = scaler.fit_transform(logC)
img = Image.fromarray(logC)
img = img.transpose(Image.FLIP_TOP_BOTTOM)
return img
def drawImage(img):
plt.figure(figsize=(12, 7))
plt.imshow(img, cmap=plt.cm.gray)
drawImage(bwLogFrequencySpectrogram(p))
The accompanying script convert.py can be used to convert all recordings to spectrograms
data/TAU-urban-acoustic-scenes-2019-development/audio/ data/spectrograms/
The script saves the data in directory structure like
|- airport
||- barcelona
||- helsinki
|...
|- bus
||- barcelona
||- helsinki
...
In the following it is assumed that the spectrograms are available in directory data/spectrograms relative to this notebook.
# Let's check
img = Image.open('data/spectrograms/street_pedestrian/lyon/1162-44650-a.png')
drawImage(img)
# You should see the same image as above
The number of recordings per category are rather small. To avoid overfitting data augmentation should be applied.
For image data a large variety of transformtions can be used for augmentation. These include for instance random resized cropping, rotations, and flipping (see for instance https://github.com/aleju/imgaug).
Not all make sense for spectrograms, eg. rotations. Reasonable transformations are:
class RandomAudioTransform:
"""Randomly transform short time spectrogram
The spectrogram is transformed randomly in three ways:
* The spectrogram frequencies are transposed randomly.
* The tempo is randomly changed.
* A random sampling window in time is chosen.
size - target image size of the spectrogram
octaves - number of octaves by which to shift spectrogram
bins_per_octaves - number of image pixels per octave
dilation - maximum relative increment of the tempo
sample_size - size of the time window in relation to the whole
spectrogram
random - True: use random values, False: replace random values by
fixed values
"""
def __init__(self, size=224, octaves=.5, bins_per_octave=24, dilation=0.25,
sample_size=.5, random=True):
"""Construct new RandomAudioTransform
"""
self.size = size
self.octaves = octaves
self.bins_per_octave = bins_per_octave
self.dilation = dilation
self.sample_size = sample_size
self.random = random
def rand(self):
"""Generate random number from interval [0., 1.[
"""
if self.random:
return random.random()
return .5
def __call__(self, img):
"""Transform a spectrogram provided as image
img - image to transform
Return - transformed image
"""
# Stretch the time axis according sample size and time dilation
width = int((1. + self.dilation * self.rand())
* self.size / self.sample_size)
img = img.resize(size=[width, img.size[1]], resample=Image.BICUBIC)
# Take sample from image
alpha = self.octaves * self.bins_per_octave / (img.size[0] - self.size)
center = [self.rand(), (1 - alpha) * .5 + alpha * self.rand()]
img = ImageOps.fit(img, size=[self.size, self.size],
method=Image.BICUBIC, centering=center)
return img
# By default a random 5 second sample of the spectrogram shifted by up to .5 octavs is created.
# Just reexecute this cell to see the random changes.
transform = RandomAudioTransform()
img = Image.open('data/spectrograms/street_pedestrian/lyon/1162-44650-a.png')
drawImage(transform(img))
pass
In "SpecAugment: A Simple Augmentation Method for Automatic Speech Recognition", 2019, Zoph et.al suggest masking frequency bands for augmentation (https://ai.google/research/pubs/pub48482/, https://arxiv.org/pdf/1904.08779.pdf).
class FrequencyMask:
"""Randomly mask frequency band in short time spectrogram
The spectrogram is transformed randomly in three ways:
* The spectrogram frequencies are transposed randomly.
* The speed is randomly changed.
* A random sampling window in time is chosen.
max_width - maximum portion of all frequencies that will be masked
"""
def __init__(self, max_width = .2):
"""Construct new FrequencyMask
"""
self.max_width = max_width
def __call__(self, img):
"""Transform a spectrogram provided as as image
img - image to transform
Return - transformed image
"""
width, height = img.size
mask_height = height * self.max_width * random.random()
ymin = math.floor((height - mask_height) * random.random())
ymax = math.floor(mask_height + ymin)
draw = ImageDraw.Draw(img)
draw.rectangle(((0, ymin), (width - 1, ymax)), fill=(127, 127, 127))
return img
# A random frequency band is masked.
# Just reexecute this cell to see the random changes.
transform = FrequencyMask()
img = Image.open('data/spectrograms/street_pedestrian/lyon/1162-44650-a.png')
drawImage(transform(img))
pass
In machine learning we should use separate data sets for training, validation and testing. Accompanying script split.py is provided for executing the split with using a 80:20:20 ratio on each setting-city combination:
python ./split.py data/spectrograms data/splitted
In the following it is assumed that the splitted data set is available in data/splitted.
# Let's check (assuming Linux as operating system). The result should be close to 8640, 2880, 2880
print("train: {}".format(os.popen("find data/splitted/train -name '*.png' | wc -l").read()))
print("valid: {}".format(os.popen("find data/splitted/valid -name '*.png' | wc -l").read()))
print("test: {}".format(os.popen("find data/splitted/test -name '*.png' | wc -l").read()))
Using a pre-trained network as starting point very much simplifies finding good solutions. One example of a pre-tained network for image recognition is VGG16 (Karen Simonyan, Andrew Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition", 2014, https://arxiv.org/abs/1409.1556) delivered with the torchvision package.
I have tested both VGG16 and VGG19 in this project but found no benefit in using VGG19 which has some additional layers.
def create_transforms(octaves=.5, bins_per_octave=24, dilation=0.25,
sample_size=.5, max_width=.2):
"""Create the train and test tranforms
octaves - number of octaves by which to shift spectrogram
bins_per_octaves - number of image pixels per octave
dilation - maximum relative increment of the tempo
sample_size - size of the time window in relation to the whole
spectrogram
max_width - maximum portion of all frequencies that will be masked
return - training and testing/validation transforms
"""
# This normalization has been used for pre-training the VGG16 model.
transform_norm = transforms.Normalize(
[0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
train_transforms = transforms.Compose([RandomAudioTransform(
octaves=octaves, bins_per_octave=bins_per_octave, dilation=dilation,
sample_size=sample_size, random=True),
FrequencyMask(max_width),
transforms.ToTensor(),
transform_norm])
test_transforms = transforms.Compose([RandomAudioTransform(
random=False),
transforms.ToTensor(),
transform_norm])
return train_transforms, test_transforms
data_dir = 'data/splitted'
def create_data_loaders(train_transforms, test_transforms, data_dir='data/splitted'):
"""Data loaders
data_dir - directory with data
train_transforms - transforms for training
test_transforms - transforms for validation and testing
"""
train_dir = os.path.join(data_dir, 'train')
test_dir = os.path.join(data_dir, 'test')
valid_dir = os.path.join(data_dir, 'valid')
train_data = datasets.ImageFolder(train_dir, transform=train_transforms)
valid_data = datasets.ImageFolder(valid_dir, transform=test_transforms)
test_data = datasets.ImageFolder(test_dir, transform=test_transforms)
class_to_idx = test_data.class_to_idx
# The batch sizes below fit into the 6GB of an Nvidia GTX 1060.
train_loader = torch.utils.data.DataLoader(
train_data, batch_size=32, shuffle=True)
valid_loader = torch.utils.data.DataLoader(
valid_data, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(
test_data, batch_size=32, shuffle=True)
return train_loader, test_loader, valid_loader, class_to_idx
def get_in_features(model):
"""Get the number of in_features of the classifier
"""
in_features = 0
for module in model.classifier.modules():
try:
in_features = module.in_features
break
except AttributeError:
pass
return in_features
def create_classifier(model, out_features, hidden_units=512):
"""Create the classifier
"""
classifier = nn.Sequential(OrderedDict([
('fc1', nn.Linear(get_in_features(model), hidden_units)),
('relu1', nn.ReLU(inplace=True)),
('drop1', nn.Dropout(.5)),
('fc2', nn.Linear(hidden_units, hidden_units)),
('relu2', nn.ReLU(inplace=True)),
('drop2', nn.Dropout(.5)),
('fc3', nn.Linear(hidden_units, out_features)),
('output', nn.LogSoftmax(dim=1))
]))
model.classifier = classifier
def load_model(out_features, hidden_units, classifier_only = True):
""" Load the pretrained model
out_features - number of features to predict
classifier_only - change classifier parameters only
"""
method_to_call = getattr(models, 'vgg16')
model = method_to_call(pretrained=True)
if classifier_only:
# Do not update model parameters
for param in model.parameters():
param.requires_grad = False
# Add our own classifier
create_classifier(model, out_features=out_features,
hidden_units=hidden_units)
return model
def check_accuracy(model, test_loader, device='cuda', mute=True, print_every=40):
"""Check the accuracy of the model"""
correct = 0
total = 0
steps = 0
expected = []
actual = []
model.to(device)
model.eval()
with torch.no_grad():
for inputs, labels in test_loader:
steps += 1
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
_, predicted = torch.max(outputs.data, 1)
for label in labels.tolist():
expected.append(label)
for prediction in predicted.tolist():
actual.append(prediction)
total += labels.size(0)
correct += (predicted == labels).sum().item()
if not mute:
print('-', end='', flush=True)
if steps % print_every == 0:
print()
accuracy = 100 * correct / total
if not mute:
print('\nAccuracy: {:.4f} %'.format(accuracy))
print('Confusion matrix')
print(confusion_matrix(expected, actual))
return accuracy
def save_checkpoint(path, model, optimizer, class_to_idx):
"""Save a checkpoint"""
model.class_to_idx = class_to_idx
warnings.filterwarnings("ignore", category=UserWarning)
torch.save({'model': model,
'optimizer': optimizer}, path)
warnings.filterwarnings("default", category=UserWarning)
def do_deep_learning(model, criterion, optimizer, train_loader, valid_loader,
class_to_idx, epochs=1, device='cuda', print_every=40,
mute=True):
"""Execute training steps
Training is discontinued if there is no improvement for 10 epochs.
"""
model.to(device)
model.train()
best_accuracy = 0.0
worse_count = 0
for epoch in range(epochs):
running_loss = 0
steps = 0
for inputs, labels in train_loader:
steps += 1
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
# Forward and backward passes
outputs = model.forward(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
if not mute:
running_loss += loss.item()
print('.', end='', flush=True)
if steps % print_every == 0:
print("Epoch: {}/{}, ".format(epoch + 1, epochs),
"Loss: {:.4f}".format(running_loss/print_every))
running_loss = 0
if not mute:
while steps % print_every != 0:
print(' ', end='')
steps += 1
print("Epoch {} completed".format(epoch + 1))
accuracy = check_accuracy(model, valid_loader, mute=mute)
if mute:
print("*", end='')
if accuracy > best_accuracy:
best_accuracy = accuracy
worse_count = 0
# To avoid disk wear the checkpoint is saved in the
# /tmp directory which hopefully is a RAM disk.
save_checkpoint('/tmp/checkpoint.pt', model, optimizer, class_to_idx)
else:
worse_count += 1
if worse_count == 10:
if not mute:
print("No improvement for 10 epochs")
break
print()
return best_accuracy
def run_training(octaves=.5, bins_per_octave=24, dilation=0.25,
sample_size=.5, max_width = .2,
hidden_units=512, epochs=200, mute=True):
""" Run training for given hyper parameters
octaves - number of octaves by which to shift spectrogram
bins_per_octaves - number of image pixels per octave
dilation - maximum relative increment of the tempo
sample_size - size of the time window in relation to the whole
spectrogram
max_width - maximum portion of all frequencies that will be masked
return - training and testing/validation transforms
"""
train_transforms, test_transforms = \
create_transforms(octaves=octaves, bins_per_octave=bins_per_octave, dilation=dilation,
sample_size=sample_size, max_width = max_width)
train_loader, test_loader, valid_loader, class_to_idx = \
create_data_loaders(train_transforms, test_transforms)
model = load_model(len(class_to_idx), hidden_units)
optimizer = optim.Adam(model.classifier.parameters(), lr=.001)
criterion = nn.NLLLoss()
accuracy = do_deep_learning(model, criterion, optimizer, epochs=epochs,
train_loader=train_loader, valid_loader=valid_loader,
class_to_idx=class_to_idx, mute=mute)
return accuracy, model, test_loader, test_transforms, class_to_idx
# To make this reproducible let'use fixed random seeds
torch.manual_seed(8259)
random.seed(6745)
accuracy, model, test_loader, test_transforms, class_to_idx = run_training(epochs=200,mute=False)
print('Achieved accuracy: {}'.format(accuracy))
The best fitting model has been saved as file checkpoint.pt. Load it and check the accuracy against the test data set.
# To avoid disk wear the checkpoints have been saved in the /tmp directory
# which hopefully is a RAM disk. Copy the current checkpoint to our directory.
os.popen("cp /tmp/checkpoint.pt checkpoint.pt 2>&1 && echo ok").read()
def load_checkpoint(path='checkpoint.pt'):
"""Reload model from checkpoint
"""
checkpoint = torch.load(path)
model = checkpoint['model']
return model
# Check the accuary using the test data
model = load_checkpoint('checkpoint.pt')
class_to_idx = model.class_to_idx
check_accuracy(model, test_loader, mute=False)
pass
The columns in the confusion matrix correspond to the predicted categories. The rows to the actual categories. Find the labels for the rows an columns below.
So a the noise inside a bus is easily recognized. But the sound of a street with pedestrians is easily mistake for a public square.
class_to_idx
Let's use the model to make a prediction
class Predict:
"""Image classifier"""
def __init__(self):
"""Constructor"""
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.image = None
self.category_names = None
self.top_k = 5
def classify(self, image_file_name, checkpoint_file_name,
category_names_file_name = None):
"""classify an image"""
self.load_checkpoint(checkpoint_file_name)
self.load_image(image_file_name)
if category_names_file_name is not None:
self.load_category_names(category_names_file_name)
probs, categories = self.infer(self.top_k)
self.output(probs, categories)
self.draw_prediction(probs, categories)
def infer(self, top_k):
"""Infer the classes"""
self.model.to(self.device)
self.model.eval()
inputs = torch.stack((self.image,))
inputs = inputs.to(self.device)
with torch.no_grad():
outputs = self.model(inputs)
outputs = outputs.to("cpu")
probs, indices = outputs.topk(top_k)
probs = probs.exp()
probs = probs.tolist()[0]
indices = indices.tolist()[0]
categories = [self.idx_to_class[index] for index in indices]
return probs, categories
def load_checkpoint(self, checkpoint_file_name):
"""Load checkpoint from file"""
checkpoint = torch.load(checkpoint_file_name, map_location={'cuda:0': 'cpu'})
self.model = checkpoint['model']
self.criterion = nn.NLLLoss()
class_to_idx = self.model.class_to_idx
self.idx_to_class = {value : key
for key, value in class_to_idx.items()}
def load_image(self, image_file_name):
"""Load image from file"""
self.image = Image.open(image_file_name).convert('RGB')
self.image = self.normalize_image(self.image)
def normalize_image(self, image):
"""Normalize image"""
return test_transforms(image)
def output(self, probs, categories):
"""Output category names and propabilities"""
if self.category_names is not None:
categories = [self.category_names[category]
for category in categories]
category_title = 'Category Name'
else:
category_title = 'Category'
max_len = max([len(category) for category in categories])
max_len = max(max_len, len(category_title))
print('{:>{}} | {}'.format(category_title, max_len, 'Propability'))
print('{:>{}}-+-{}'.format('-' * max_len, max_len, '-----------'))
for i in range(len(probs)):
print('{:>{}} | {:.4f}'.format( categories[i], max_len, probs[i]))
def draw_prediction(self, propabilities, categories,
true_category = '', title = 'Prediction'):
"""Draw a plot showing the prediction"""
fig, plot = plt.subplots(nrows=1, ncols=1, figsize=(7,5))
fig.tight_layout()
fig.suptitle(title, fontsize=20, y=1.05)
margin = 0.05
n_cat = len(categories)
ind = np.arange(n_cat)
width = (1. - 2. * margin) / n_cat
plot.barh(ind, propabilities[::-1], height = .8)
plot.set_yticks(ind + margin)
plot.set_yticklabels(categories[::-1], fontsize=16)
plot.set_xlabel('propability')
def set_device(self, device_name):
"""Set the cuda device"""
self.device = torch.device(device_name)
def set_top_k(self, top_k):
"""Set number of categories to output"""
self.top_k = top_k
# Can you identify the sound?
x = readAudioFile(rawDataPath + 'bus-milan-1115-42136-a.wav')
ipd.Audio(x, rate=sr)
# Check your guess against the prediction.
predict = Predict()
predict.classify('data/spectrograms/bus/milan/1115-42136-a.png', 'checkpoint.pt', None)
The parameters chosen for the model above provide a reasonable accuracy. But can we do better?
With limited project time let's just concentrate on two of them controlling augmentation.
def refine():
"""Optimize hyper parameters"""
# If we have enough computation time, we should use multiple runs with different
# random seeds per run to get a more reliable measure for the accuracy.
repetitions = 3
# To make this reproducible let'use fixed random seeds
seeds = [8259, 6745, 14986, 4215];
best_accuracy = 0
best_parameters = {}
best_testloader = None
best_test_transforms = None
sample_sizes = [.3, .5, .7]
max_widths = [0., .1, .2]
for sample_size in sample_sizes:
for max_width in max_widths:
accuracies = []
for i in range(repetitions):
torch.manual_seed(seeds[i])
random.seed(seeds[i])
accuracy, model, test_loader, test_transforms, class_to_idx = \
run_training(sample_size=sample_size, max_width=max_width,
hidden_units=512, epochs=200, mute=True)
accuracies.append(accuracy)
average_accuracy = statistics.mean(accuracies)
if (average_accuracy > best_accuracy):
best_accuracy = average_accuracy
best_parameters= {
'sample_size' : sample_size,
'max_width' : max_width
}
os.popen("cp /tmp/checkpoint.pt /tmp/best_checkpoint.pt 2>&1")
print('new best');
print('sample_size = {}, max_width = {}, average_accuracy = {:.4f} +/- {:.4f} %'.format(
sample_size, max_width, average_accuracy, statistics.stdev(accuracies)))
return best_parameters, best_accuracy
best_parameters, best_accuracy = refine()
print('best parameters: {}'.format(best_parameters))
print('best accuracy: {:.4f} %'.format(best_accuracy))
os.popen("cp /tmp/best_checkpoint.pt best_checkpoint.pt 2>&1")
pass
# Check the accuary using the test data
model = load_checkpoint('best_checkpoint.pt')
class_to_idx = model.class_to_idx
check_accuracy(model, test_loader, mute=False)
pass
The accuracy is improved slightly. But given the standard deviations shown in the grid search the difference is within the error margins of accuracy.
The grid search results indicate for the different augmentation techniques applied:
The project folders contains two standalone Python programs.
A workflow for classifying ambiant sound was demonstrated:
Though a network was used that is not specificically built for this classification task respectable accuracy rates were be achieved.
Directions for further investigation could be